(Partially abridged from Alison Hill’s CS631 Labs)
dplyr functions learned at the beginning of the
courseggplot2 to recreate a few graphsggplot2We’ll use data from the New York’s Museum of Modern Art (MoMA):
The journey of a data scientist begins with some housekeeping on the input data, which is rarely ready-to-use but very often needs to be cleaned and prepared for the analyses.
library(tidyverse)
library(janitor)
moma <- read_csv("data_artworks.csv", col_types = cols(BeginDate = col_number(),
EndDate = col_number(), `Length (cm)` = col_number(), `Circumference (cm)` = col_number(),
`Duration (sec.)` = col_number(), `Diameter (cm)` = col_number())) %>%
clean_names()
problems(moma)
Wo do a basic cleaning with stringr of
gender variable, which refers to the gender of the artist
(a () is used a placeholder for “various artists”)
library(stringr)
moma <- moma %>%
mutate(gender = str_replace_all(gender, fixed("(female)",
ignore_case = TRUE), "F"), gender = str_replace_all(gender,
fixed("(male)", ignore_case = TRUE), "M"), num_artists = str_count(gender,
"[:alpha:]"), num_artists = na_if(num_artists, 0), n_female_artists = str_count(gender,
"F"), n_male_artists = str_count(gender, "M"), artist_gender = case_when(num_artists ==
1 & n_female_artists == 1 ~ "Female", num_artists ==
1 & n_male_artists == 1 ~ "Male"))
Let’s also do some detecting of strings in the
credit_line variable.
moma <- moma %>%
mutate(purchase = str_detect(credit_line, fixed("purchase",
ignore_case = TRUE)), gift = str_detect(credit_line,
fixed("gift", ignore_case = TRUE)), exchange = str_detect(credit_line,
fixed("exchange", ignore_case = TRUE)))
Let’s clean up some dates:
lubridate to pull out
the year.stringr::str_extract()library(lubridate)
moma <- moma %>%
mutate(year_acquired = year(date_acquired)) %>%
rename(artist_birth_year = begin_date, artist_death_year = end_date) %>%
mutate(year_created = str_extract(date, "\\d{4}"), artist_birth_year = na_if(artist_birth_year,
0), artist_death_year = na_if(artist_death_year, 0))
What different kinds of art classifications are available?
moma %>%
distinct(classification) %>%
print(n = Inf)
## # A tibble: 31 × 1
## classification
## <chr>
## 1 Architecture
## 2 Mies van der Rohe Archive
## 3 Design
## 4 Illustrated Book
## 5 Print
## 6 Drawing
## 7 Film
## 8 Multiple
## 9 Periodical
## 10 Photograph
## 11 Painting
## 12 (not assigned)
## 13 Architectural Model
## 14 Product Design
## 15 Video
## 16 Media
## 17 Performance
## 18 Sculpture
## 19 Photography Research/Reference
## 20 Software
## 21 Installation
## 22 Work on Paper
## 23 Audio
## 24 Textile
## 25 Ephemera
## 26 Collage
## 27 Film (object)
## 28 Frank Lloyd Wright Archive
## 29 Poster
## 30 Graphic Design
## 31 Furniture and Interiors
We want to focus on standard rectangular paintings:
classification (“Painting”)NA)
height or width measurements, or who have 0 for either
height or width.library(tidyr)
moma <- moma %>%
filter(classification == "Painting") %>%
drop_na(height_cm, width_cm) %>%
filter(height_cm > 0 & width_cm > 0)
We focus only on a subset of columns:
moma <- moma %>%
select(title, contains("artist"), contains("year"), contains("_cm"),
purchase, gift, exchange, classification, department)
Now let’s export this data frame, in case we want to start right away from the cleaned data.
write_csv(moma, "artworks-cleaned.csv")
As you can see, we did a lot of cleaning and decision-making in the pre-processing. The data we have now contain only paintings and drawings in the MoMA collection.
If you start working from the cleaned data, you just load them from the saved CSV file:
library(here)
library(readr)
library(dplyr)
moma <- read_csv("artworks-cleaned.csv")
You cleaned and prepared the data: now it’s time to know your data and start asking questions.
For example:
moma? How many
variables (columns) are in moma?And more:
Let’s see how we can answer some of these questions!
moma?moma?These questions can be answered, for example, using the
dplyr::glimpse() function.
moma
glimpse(moma)
## Rows: 2,253
## Columns: 23
## $ title <chr> "Rope and People, I", "Fire in the Evening", "Portra…
## $ artist <chr> "Joan Miró", "Paul Klee", "Paul Klee", "Pablo Picass…
## $ artist_bio <chr> "(Spanish, 1893–1983)", "(German, born Switzerland. …
## $ artist_birth_year <dbl> 1893, 1879, 1879, 1881, 1880, 1879, 1943, 1880, 1839…
## $ artist_death_year <dbl> 1983, 1940, 1940, 1973, 1946, 1953, 1977, 1950, 1906…
## $ num_artists <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ n_female_artists <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ n_male_artists <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ artist_gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ year_acquired <dbl> 1936, 1970, 1966, 1955, 1939, 1968, 1997, 1931, 1934…
## $ year_created <chr> "1935", "1929", "1927", "1919", "1925", "1919", "197…
## $ circumference_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ depth_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ diameter_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ height_cm <dbl> 104.8, 33.8, 60.3, 215.9, 50.8, 129.2, 200.0, 54.6, …
## $ length_cm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ width_cm <dbl> 74.6, 33.3, 36.8, 78.7, 54.0, 89.9, 200.0, 38.1, 96.…
## $ seat_height_cm <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ purchase <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ gift <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, F…
## $ exchange <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALS…
## $ classification <chr> "Painting", "Painting", "Painting", "Painting", "Pai…
## $ department <chr> "Painting & Sculpture", "Painting & Sculpture", "Pai…
There are 2253 paintings in moma.
Hint: These questions can be answered by combining two
dplyr functions, select and
arrange.
moma %>%
select(artist, title, year_acquired) %>%
arrange(year_acquired)
Again, these questions can be answered by combining the
dplyr functions select and
arrange.
moma %>%
select(artist, title, year_created) %>%
arrange(year_created)
To do inline comments, I could say that the oldest painting is Landscape at Daybreak, painted by Odilon Redon in 1872.
moma %>%
distinct(artist)
Pro tip: You could add a tally() too to get just the
number of rows. You can also then use pull() to get that
single number out of the tibble:
num_artists <- moma %>%
distinct(artist) %>%
# tally() is short for df %>% summarise(n = n())
tally() %>%
pull()
num_artists
## [1] 989
Then I can refer to this number in inline code chunks like: there are 989 total.
moma?moma %>%
count(artist, sort = TRUE)
In the ?count documentation, it says:
“count and tally are designed so that you can
call them repeatedly, each time rolling up a level of detail.” Try
running count() again (leave parentheses empty) on your
last code chunk.
moma %>%
count(artist, sort = TRUE) %>%
count()
moma %>%
count(artist_gender)
Now we’ll count the number of artists by gender. You’ll need to give
count two variable names in the parentheses:
artist_gender and artist.
moma %>%
count(artist_gender, artist, sort = TRUE)
This output is not super helpful as we already know that Pablo
Picasso has 55 paintings in the MoMA collection. But how can we find out
which female artist has the most paintings? We have a few options. Let’s
first add a filter for females.
moma %>%
count(artist_gender, artist, sort = TRUE) %>%
filter(artist_gender == "Female")
Another option is to use another dplyr function called
top_n(). Use ?top_n to see how it works. Or
how it won’t work, in this context:
moma %>%
count(artist_gender, artist, sort = TRUE) %>%
top_n(2)
How it will work better is following a
group_by(artist_gender):
moma %>%
count(artist_gender, artist, sort = TRUE) %>%
group_by(artist_gender) %>%
top_n(1)
Now we can see that Sherrie Levine has 12 paintings. This is a pretty far cry from the 55 paintings by Pablo Picasso.
This is a harder question to answer than you think! This is because
the level of observation in our current moma dataset is
unique paintings. We have multiple paintings done by the same
artists though, so counting just the number of unique paintings is
different than counting the number of unique artists.
Remember how count can be used back-to-back to roll up a
level of detail? We try that by running
count(artist_gender) again on the last code chunk.
moma %>%
count(artist_gender, artist) %>%
count(artist_gender)
This output takes the previous table (made with
count(artist_gender, artist)), and essentially ignores the
n column. So we no longer care about how many
paintings each individual artist created. Instead, we want to
count the rows in this new table where each row is
a unique artist. By counting by artist_gender in the last
line, we are grouping by levels of that variable (so
Female/Male/NA) and nn is the number of unique
artists for each gender category recorded.
This is another job for dplyr::count, which we can also
use to sort by the counts:
moma %>%
count(year_acquired, sort = TRUE)
moma %>%
count(year_created, sort = TRUE)
To answer this question, we combine filter,
select, and arrange from
dplyr.
When was the first painting by a solo female artist acquired?
moma %>%
filter(num_artists == 1 & n_female_artists == 1) %>%
select(title, artist, year_acquired, year_created) %>%
arrange(year_acquired)
What is the oldest painting by a solo female artist, and when was it created?
moma %>%
filter(num_artists == 1 & n_female_artists == 1) %>%
select(title, artist, year_acquired, year_created) %>%
arrange(year_created)
Let’s recreate this plot from fivethirtyeight (mostly)!
Things to consider:
alpha value
here - keep in mind that 0 is totally transparent and
1 is opaque.geom_abline() to add the line in red (use the
default intercept value of 0). The actual red line is difficult to
recreate - here is what the authors say: “The red regression line shows
the “modernizing” of MoMA’s collection — how quickly the museum has
moved toward acquiring recent paintings.”ggplot(moma, aes(as.numeric(year_created), as.numeric(year_acquired))) +
geom_point(alpha = 0.3, na.rm = TRUE) + geom_abline(intercept = c(0,
0), colour = "red") + labs(x = "Year Painted", y = "Year Acquired",
title = "MoMA Keeps Its Collection Current", subtitle = "Yeaf of a work's acquisition vs. year it was painted")
Can you make the same plot above, but facet by artist gender?
For this to make sense, you probably want to do some filtering to select only those paintings where there was one “solo” artist.
moma_solo <- moma %>%
filter(num_artists == 1)
ggplot(moma_solo, aes(as.numeric(year_created), as.numeric(year_acquired))) +
geom_point(alpha = 0.1) + geom_abline(intercept = c(0, 0),
colour = "red") + labs(x = "Year Painted", y = "Year Acquired") +
ggtitle("MoMA Keeps Its Collection Current") + facet_wrap(~artist_gender)
Let’s (somewhat) try to recreate this scatterplot from fivethirtyeight:
Some things to consider:
mutate.Hint: You’ll probably also want to look into case_when
to create a categorical variable “on the fly” to use for coloring.
moma_dim <- moma %>%
filter(height_cm < 600, width_cm < 760) %>%
mutate(hw_ratio = height_cm/width_cm, hw_cat = case_when(hw_ratio >
1 ~ "taller than wide", hw_ratio < 1 ~ "wider than tall",
hw_ratio == 1 ~ "perfect square"))
library(ggthemes) # to load the fivethirtyeight theme
ggplot(moma_dim, aes(x = width_cm, y = height_cm, colour = hw_cat)) +
geom_point(alpha = 0.5) + ggtitle("MoMA Paintings, Tall and Wide") +
scale_colour_manual(name = "", values = c("gray50", "#FF9900",
"#B14CF0")) + theme_fivethirtyeight() + theme(axis.title = element_text()) +
labs(x = "Width", y = "Height")
We can do better with colors!
ggplot(moma_dim, aes(x = width_cm, y = height_cm, colour = hw_cat)) +
geom_point(alpha = 0.5) + ggtitle("MoMA Paintings, Tall and Wide") +
scale_colour_manual(name = "", values = c("gray50", "#ee5863",
"#6999cd")) + theme_fivethirtyeight() + theme(axis.title = element_text()) +
labs(x = "Width", y = "Height")
We could also remove the legend and use an annotation layer instead:
ggplot(moma_dim, aes(x = width_cm, y = height_cm, colour = hw_cat)) +
geom_point(alpha = 0.5, show.legend = FALSE) + ggtitle("MoMA Paintings, Tall and Wide") +
scale_colour_manual(name = "", values = c("gray50", "#ee5863",
"#6999cd")) + theme_fivethirtyeight() + theme(axis.title = element_text()) +
labs(x = "Width", y = "Height") + annotate(x = 200, y = 380,
geom = "text", label = "Taller than\nWide", color = "#ee5863",
size = 5, hjust = 1, fontface = 2) + annotate(x = 375, y = 100,
geom = "text", label = "Wider than\nTall", color = "#6999cd",
size = 5, hjust = 0, fontface = 2)
It can be anything - you can change colors, add annotations, switch the geoms, add new variables to examine. The only requirements are:
It does not have to be publication-ready right now, but it should make sense as a visualization.